Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

نویسندگان

  • Yufei Xiao
  • Jianping Hua
  • Edward R. Dougherty
چکیده

Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the t-test for feature selection; and k-fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine

Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods.  In filter methods, features subsets are selected due to some measu...

متن کامل

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of...

متن کامل

High-dimensional bolstered error estimation

MOTIVATION In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on thi...

متن کامل

Application of the MoDrY model for the estimation of potato yielding

The study was conducted with the application of the model MoDrY (Model-Dry periods-Yield) for the estimation of the level of potato yields on the basis of dry periods occurring during the particular periods between the phenological phases of the crop plant. A characteristic feature of this model, unlike most existing weatheryield models, is that the principle of its operation is based only ...

متن کامل

Impact of error estimation on feature selection

Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2007  شماره 

صفحات  -

تاریخ انتشار 2007